This bike sharing dataset (hour.csv) was obtained from UCI machine learning repository. Below is information of the dataset extracted and modified from the included “Readme.txt” :


Background


Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.


Dataset


The bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C.


Dataset Characteristics


hour.csv

- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of bike rented bycasual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

Objective


Prediction of hourly bike rental count based on the environmental and seasonal settings.

Let’s read the input and take a look at its structure.

# Read
hour.data <- read.csv("hour.csv", header= TRUE,stringsAsFactors = FALSE)

# Overview
str(hour.data)
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
head(hour.data)
##   instant     dteday season yr mnth hr holiday weekday workingday
## 1       1 2011-01-01      1  0    1  0       0       6          0
## 2       2 2011-01-01      1  0    1  1       0       6          0
## 3       3 2011-01-01      1  0    1  2       0       6          0
## 4       4 2011-01-01      1  0    1  3       0       6          0
## 5       5 2011-01-01      1  0    1  4       0       6          0
## 6       6 2011-01-01      1  0    1  5       0       6          0
##   weathersit temp  atemp  hum windspeed casual registered cnt
## 1          1 0.24 0.2879 0.81    0.0000      3         13  16
## 2          1 0.22 0.2727 0.80    0.0000      8         32  40
## 3          1 0.22 0.2727 0.80    0.0000      5         27  32
## 4          1 0.24 0.2879 0.75    0.0000      3         10  13
## 5          1 0.24 0.2879 0.75    0.0000      0          1   1
## 6          2 0.24 0.2576 0.75    0.0896      0          1   1

Split data into training and testing datasets for applying models.

train <- hour.data[as.integer(substr(hour.data$dteday,9,10)) < 22, ]
test <- hour.data[as.integer(substr(hour.data$dteday,9,10)) > 21, ]

# Training: 69.2% 
nrow(train)/ nrow(hour.data)
## [1] 0.6924449
# Testing: 30.7%
nrow(test)/ nrow(hour.data)
## [1] 0.3075551

Model Selection


GOAL: Apply different models to find predictive results of hourly total rental of a day

Technical explanation: For each model built upon training dataset, there will be predictive values against actual values of the testing dataset. The measurement used here is mean((y - yhat)^2) i.e. Mean Squared Error(MSE). We are going to apply multiple models and figure out the one that minimizes MSE. The total rental (cnt) of a day is the sum of registered users (registered) and casual users (casual). After trying different combinations, we found out that using 2 separate models to predict the number of registered users and casual users yield better result than using a single model to predict total rentals (cnt)

The followings are models we have applied:

  1. Neural Networks (NN)
  2. Linear Regression
  3. Support Vector Machine (SVM)
  4. Random Forest
  5. Gereralized Boosting Model (GBM)
  6. Regression Tree

The results for each model are “cnt.MSE”, “combined.MSE”, “registered.MSE”, “casual.MSE”

# Load packages
library(nnet)
library(ggplot2)
library(ggthemes)
library(gbm)
library(randomForest)
library(e1071)
library(rpart)

First, get data ready. Before factorizing some of the attributes, we leave numeric variables as they are for Neural Networks.


(1) Neural Networks

Step1: Orignal model

# Orignal Model
neural.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model = nnet(neural.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights:  281
## initial  value 832864009.618527 
## iter  10 value 401885763.140496
## iter  20 value 348978754.378820
## iter  30 value 283838842.601240
## iter  40 value 279704475.218207
## iter  50 value 272687834.520042
## iter  60 value 258199071.170540
## iter  70 value 256275478.213297
## iter  80 value 243343886.713717
## iter  90 value 235322648.202721
## iter 100 value 233331727.993393
## iter 110 value 231727244.639550
## iter 120 value 225599415.914042
## iter 130 value 223289567.934950
## iter 140 value 221943319.078645
## iter 150 value 220557329.620354
## iter 160 value 220016346.942685
## iter 170 value 217213932.679547
## iter 180 value 216560694.258283
## iter 190 value 215558041.452651
## iter 200 value 213938999.014902
## iter 210 value 212820516.089442
## iter 220 value 210136264.007181
## iter 230 value 206102474.622544
## iter 240 value 205758459.787147
## iter 250 value 204431080.291385
## iter 260 value 202766208.932347
## iter 270 value 202076288.588333
## iter 280 value 201093977.567429
## iter 290 value 200667884.474503
## iter 300 value 199951095.739293
## iter 310 value 199426226.617432
## iter 320 value 198973143.088701
## iter 330 value 198289191.954146
## iter 340 value 195195864.541958
## iter 350 value 194522344.330864
## iter 360 value 194233159.991431
## iter 370 value 194020817.358865
## iter 380 value 193744875.825689
## iter 390 value 193613581.064182
## iter 400 value 193507008.131858
## iter 410 value 193212104.989112
## iter 420 value 193146714.611361
## iter 430 value 193041644.596600
## iter 440 value 192419689.254587
## iter 450 value 189402313.985063
## iter 460 value 187464739.089603
## iter 470 value 186850851.856394
## iter 480 value 186426292.245397
## iter 490 value 185347069.865467
## iter 500 value 185095904.084422
## iter 510 value 184918778.173578
## iter 520 value 184598360.161520
## iter 530 value 183801695.239463
## iter 540 value 183260818.813020
## iter 550 value 182766495.644261
## iter 560 value 182102017.387564
## iter 570 value 181748920.541705
## iter 580 value 181170863.952553
## iter 590 value 180574396.914170
## iter 600 value 180443288.568554
## iter 610 value 179357458.281890
## iter 620 value 178895627.321713
## iter 630 value 178608862.929772
## iter 640 value 178314365.405016
## iter 650 value 177373142.084943
## iter 660 value 177114001.979165
## iter 670 value 176539718.654416
## iter 680 value 175449727.670270
## iter 690 value 174994938.239167
## iter 700 value 174281063.378202
## iter 710 value 173231381.795726
## iter 720 value 170926867.407710
## iter 730 value 167644917.758654
## iter 740 value 166173089.236732
## iter 750 value 164797757.685431
## iter 760 value 164269185.172631
## iter 770 value 164174656.338741
## iter 780 value 163776497.285036
## iter 790 value 163186004.360750
## iter 800 value 162577528.947126
## iter 810 value 160953879.121289
## iter 820 value 160080875.285189
## iter 830 value 159070822.746583
## iter 840 value 158619092.588017
## iter 850 value 157722078.940917
## iter 860 value 157082571.804362
## iter 870 value 156410785.343051
## iter 880 value 155822574.044754
## iter 890 value 155619124.659299
## iter 900 value 155010728.721440
## iter 910 value 152345391.169099
## iter 920 value 150658174.793157
## iter 930 value 149832663.008073
## iter 940 value 148800574.883277
## iter 950 value 148448315.657276
## iter 960 value 147992464.911566
## iter 970 value 147746853.460349
## iter 980 value 147456629.392884
## iter 990 value 146857856.251058
## iter1000 value 145985501.219347
## iter1010 value 145417449.570299
## iter1020 value 144951764.465454
## iter1030 value 144750159.816561
## iter1040 value 144478285.585400
## iter1050 value 143674021.339089
## iter1060 value 143238846.447221
## iter1070 value 143009921.090186
## iter1080 value 142712813.649086
## iter1090 value 142613454.724923
## iter1100 value 142506053.820623
## iter1110 value 142433079.316332
## iter1120 value 142287834.081657
## iter1130 value 142071154.609680
## iter1140 value 141410306.816499
## iter1150 value 140829573.739404
## iter1160 value 140503734.565503
## iter1170 value 140361951.887020
## iter1180 value 140360057.755651
## iter1190 value 140356917.507246
## iter1200 value 140352733.973246
## iter1210 value 140349182.850679
## iter1220 value 140342667.931720
## iter1230 value 140315415.971716
## iter1240 value 140298176.299817
## iter1250 value 140269682.941451
## iter1260 value 140257838.802483
## iter1270 value 140241839.011219
## iter1280 value 140236931.028976
## iter1290 value 140233498.892464
## iter1300 value 140217764.052089
## iter1310 value 140209237.797160
## iter1320 value 140206761.340625
## iter1330 value 140198362.231654
## iter1340 value 140193584.109016
## iter1350 value 140187947.647596
## iter1360 value 140145685.010396
## iter1370 value 140078963.497798
## iter1380 value 140007183.508499
## iter1390 value 139962346.332180
## iter1400 value 139848384.007793
## iter1410 value 139793473.225434
## iter1420 value 139775220.188604
## iter1430 value 139733087.589917
## iter1440 value 139668965.210955
## iter1450 value 139538907.473781
## iter1460 value 139263862.836763
## iter1470 value 138913104.915816
## iter1480 value 138551011.261601
## iter1490 value 138201623.150725
## iter1500 value 137961884.771817
## iter1510 value 137423125.143519
## iter1520 value 137053700.173198
## iter1530 value 136792040.824953
## iter1540 value 136647222.812530
## iter1550 value 136457468.851357
## iter1560 value 136214535.404527
## iter1570 value 136105654.805139
## iter1580 value 136028183.824054
## iter1590 value 135941599.254453
## iter1600 value 135875421.337441
## iter1610 value 135843653.936822
## iter1620 value 135823828.205787
## iter1630 value 135755205.957870
## iter1640 value 135677418.608513
## iter1650 value 135556945.397105
## iter1660 value 135525881.861988
## iter1670 value 135479359.102094
## iter1680 value 135442002.268532
## iter1690 value 135404559.271277
## iter1700 value 135355071.531306
## iter1710 value 135306641.138099
## iter1720 value 135282310.716284
## iter1730 value 135260637.637358
## iter1740 value 135244842.780120
## iter1750 value 135203560.791885
## iter1760 value 135140072.245297
## iter1770 value 135101389.721845
## iter1780 value 135060575.892804
## iter1790 value 135019505.899904
## iter1800 value 134953680.490498
## iter1810 value 134925945.372750
## iter1820 value 134907925.709135
## iter1830 value 134898873.100306
## iter1840 value 134890823.455050
## iter1850 value 134867385.077882
## iter1860 value 134858584.495069
## iter1870 value 134849111.911654
## iter1880 value 134833801.007796
## iter1890 value 134802970.485236
## iter1900 value 134769999.492702
## iter1910 value 134662842.557927
## iter1920 value 134503320.037535
## iter1930 value 134352982.749919
## iter1940 value 134206612.273081
## iter1950 value 134043812.008293
## iter1960 value 133906790.278007
## iter1970 value 133770980.579687
## iter1980 value 133614632.261176
## iter1990 value 133424227.116161
## iter2000 value 133036524.485278
## iter2010 value 132672364.743171
## iter2020 value 131999467.059997
## iter2030 value 131033140.520079
## iter2040 value 130136690.386980
## iter2050 value 129328093.615910
## iter2060 value 128319051.198663
## iter2070 value 127531419.265729
## iter2080 value 126809004.747040
## iter2090 value 126273914.228043
## iter2100 value 125719602.555520
## iter2110 value 125235491.070953
## iter2120 value 124755489.351092
## iter2130 value 124019260.511297
## iter2140 value 123237742.484777
## iter2150 value 122565265.949055
## iter2160 value 121167363.842853
## iter2170 value 120135924.294864
## iter2180 value 118187137.757227
## iter2190 value 116795086.625470
## iter2200 value 114860023.305526
## iter2210 value 113715631.227274
## iter2220 value 112541771.605901
## iter2230 value 111881079.403062
## iter2240 value 111519893.653368
## iter2250 value 111105963.134541
## iter2260 value 110751570.141549
## iter2270 value 110384535.772715
## iter2280 value 109846009.217578
## iter2290 value 109222486.931579
## iter2300 value 108311333.334972
## iter2310 value 107668722.563999
## iter2320 value 106916152.291830
## iter2330 value 106268170.934974
## iter2340 value 105869098.495028
## iter2350 value 105724533.851466
## iter2360 value 105574481.157928
## iter2370 value 105326268.424668
## iter2380 value 105105862.460823
## iter2390 value 104841840.277096
## iter2400 value 104589077.083883
## iter2410 value 104412637.676957
## iter2420 value 104312591.577398
## iter2430 value 104242175.112225
## iter2440 value 104132198.727228
## iter2450 value 103987226.806518
## iter2460 value 103871011.222898
## iter2470 value 103667949.486414
## iter2480 value 103496753.046199
## iter2490 value 103249894.947678
## iter2500 value 103050737.378224
## iter2510 value 102818366.676224
## iter2520 value 102694703.071025
## iter2530 value 102598067.715097
## iter2540 value 102498507.633228
## iter2550 value 102416613.677689
## iter2560 value 102205503.967940
## iter2570 value 102077570.284993
## iter2580 value 101943186.663412
## iter2590 value 101808578.900904
## iter2600 value 101730222.697376
## iter2610 value 101686510.673185
## iter2620 value 101623588.182138
## iter2630 value 101509024.603968
## iter2640 value 101383850.517000
## iter2650 value 101252575.331983
## iter2660 value 101144652.555557
## iter2670 value 101050379.841317
## iter2680 value 100970168.238717
## iter2690 value 100816070.250370
## iter2700 value 100592107.623472
## iter2710 value 100362915.142496
## iter2720 value 100119872.997113
## iter2730 value 99829769.591708
## iter2740 value 99524330.520948
## iter2750 value 99285239.454003
## iter2760 value 99084400.044720
## iter2770 value 98631959.708164
## iter2780 value 98172410.544738
## iter2790 value 97620848.769242
## iter2800 value 97019844.224685
## iter2810 value 95936875.927412
## iter2820 value 95162460.817162
## iter2830 value 94684570.487954
## iter2840 value 94341586.395075
## iter2850 value 93547252.642290
## iter2860 value 92732895.879763
## iter2870 value 91150918.202953
## iter2880 value 90082804.362325
## iter2890 value 89348324.133822
## iter2900 value 89230446.258464
## iter2910 value 89149871.002549
## iter2920 value 89063134.829888
## iter2930 value 88863943.024427
## iter2940 value 88699553.381882
## iter2950 value 88534217.423704
## iter2960 value 88348440.896396
## iter2970 value 88246585.033001
## iter2980 value 88105752.822829
## iter2990 value 87951205.743158
## iter3000 value 87679477.656151
## iter3010 value 87520974.331242
## iter3020 value 87353890.652643
## iter3030 value 87174755.462827
## iter3040 value 87018572.871537
## iter3050 value 86790513.662254
## iter3060 value 86609220.107067
## iter3070 value 86480541.821860
## iter3080 value 86405319.708296
## iter3090 value 86346111.640003
## iter3100 value 86245547.663425
## iter3110 value 86083669.717867
## iter3120 value 85913068.788343
## iter3130 value 85719633.334796
## iter3140 value 85501589.411787
## iter3150 value 85056085.719742
## iter3160 value 84690018.448498
## iter3170 value 84358168.792454
## iter3180 value 83695654.799645
## iter3190 value 82201463.624141
## iter3200 value 80210449.256535
## iter3210 value 77281453.342227
## iter3220 value 71940969.624103
## iter3230 value 66862741.391722
## iter3240 value 58952532.971234
## iter3250 value 52403859.119344
## iter3260 value 50392785.057772
## iter3270 value 49205386.139975
## iter3280 value 48119962.682653
## iter3290 value 48014920.363831
## iter3300 value 47801842.214646
## iter3310 value 47627682.296674
## iter3320 value 47334459.417899
## iter3330 value 46860269.488157
## iter3340 value 46671526.420674
## iter3350 value 46570409.816785
## iter3360 value 46509335.792177
## iter3370 value 46412459.710476
## iter3380 value 46277453.749293
## iter3390 value 46166874.600040
## iter3400 value 45908182.440148
## iter3410 value 45666669.463691
## iter3420 value 45448015.224621
## iter3430 value 45189433.915287
## iter3440 value 44812108.012661
## iter3450 value 44491263.590546
## iter3460 value 44126947.140265
## iter3470 value 43785227.838795
## iter3480 value 43600527.571279
## iter3490 value 43304554.045295
## iter3500 value 43207150.360274
## iter3510 value 43123426.257024
## iter3520 value 43049502.115530
## iter3530 value 42967912.894376
## iter3540 value 42916106.809391
## iter3550 value 42873876.648944
## iter3560 value 42833648.807118
## iter3570 value 42782826.741581
## iter3580 value 42752689.351083
## iter3590 value 42707623.227127
## iter3600 value 42671474.813470
## iter3610 value 42629434.751530
## iter3620 value 42583094.180221
## iter3630 value 42535937.007251
## iter3640 value 42425747.447780
## iter3650 value 42320001.203367
## iter3660 value 42157460.466764
## iter3670 value 42071330.305366
## iter3680 value 42003015.985785
## iter3690 value 41934324.690591
## iter3700 value 41857025.685952
## iter3710 value 41744704.357937
## iter3720 value 41634807.993501
## iter3730 value 41521342.564599
## iter3740 value 41300779.286309
## iter3750 value 41186445.779121
## iter3760 value 41157086.537967
## iter3770 value 41117390.511119
## iter3780 value 41091381.793835
## iter3790 value 41051336.890762
## iter3800 value 41007579.037139
## iter3810 value 40968545.291314
## iter3820 value 40936319.898544
## iter3830 value 40920824.105293
## iter3840 value 40908722.948887
## iter3850 value 40898366.088890
## iter3860 value 40886624.607109
## iter3870 value 40869096.247405
## iter3880 value 40847176.066384
## iter3890 value 40832919.569154
## iter3900 value 40798412.783543
## iter3910 value 40760536.825810
## iter3920 value 40734547.308361
## iter3930 value 40716583.193658
## iter3940 value 40699799.105852
## iter3950 value 40688901.488405
## iter3960 value 40668574.772251
## iter3970 value 40664249.828013
## iter3980 value 40661378.940314
## iter3990 value 40658719.501348
## iter4000 value 40651929.515931
## iter4010 value 40643137.825240
## iter4020 value 40639239.982397
## iter4030 value 40622285.060594
## iter4040 value 40607816.289442
## iter4050 value 40563976.850564
## iter4060 value 40549022.548591
## iter4070 value 40534337.360902
## iter4080 value 40523953.217672
## iter4090 value 40511976.838580
## iter4100 value 40438773.129380
## iter4110 value 40388547.259580
## iter4120 value 40258235.678470
## iter4130 value 40104017.909276
## iter4140 value 40078092.409546
## iter4150 value 40042767.465935
## iter4160 value 40018378.825907
## iter4170 value 40006039.095887
## iter4180 value 39994625.175644
## iter4190 value 39978793.361146
## iter4200 value 39953685.462221
## iter4210 value 39923401.221300
## iter4220 value 39818158.189066
## iter4230 value 39731022.720401
## iter4240 value 39600079.654829
## iter4250 value 39418376.500320
## iter4260 value 39257446.058654
## iter4270 value 39101129.105406
## iter4280 value 38965174.792821
## iter4290 value 38826225.665225
## iter4300 value 38717336.243305
## iter4310 value 38619659.093649
## iter4320 value 38502651.215678
## iter4330 value 38377417.931270
## iter4340 value 38275532.777715
## iter4350 value 38140431.616971
## iter4360 value 37887902.065785
## iter4370 value 37559784.441424
## iter4380 value 37551056.347525
## iter4390 value 37537750.984021
## iter4400 value 37502171.909502
## iter4410 value 37443161.258541
## iter4420 value 37407000.099015
## iter4430 value 37381945.533374
## iter4440 value 37334221.344278
## iter4450 value 37289378.655420
## iter4460 value 37238552.656282
## iter4470 value 37215737.299979
## iter4480 value 37169730.613472
## iter4490 value 37108936.069496
## iter4500 value 37037177.987864
## iter4510 value 36984734.225380
## iter4520 value 36943032.875736
## iter4530 value 36835552.019373
## iter4540 value 36724094.132659
## iter4550 value 36643474.349354
## iter4560 value 36555214.887095
## iter4570 value 36494923.451563
## iter4580 value 36430144.620388
## iter4590 value 36373303.451998
## iter4600 value 36292317.709649
## iter4610 value 36226401.087591
## iter4620 value 36190665.010978
## iter4630 value 36135056.222518
## iter4640 value 36006938.096010
## iter4650 value 35945019.242678
## iter4660 value 35917414.343570
## iter4670 value 35899944.134005
## iter4680 value 35878392.391346
## iter4690 value 35851089.828150
## iter4700 value 35780496.629362
## iter4710 value 35679583.168144
## iter4720 value 35621059.530129
## iter4730 value 35575247.092409
## iter4740 value 35574064.435525
## iter4750 value 35572406.201197
## iter4760 value 35570825.181430
## iter4770 value 35567228.064440
## iter4780 value 35561470.239392
## iter4790 value 35558385.035623
## iter4800 value 35556643.138034
## iter4810 value 35552299.825962
## iter4820 value 35550679.870202
## iter4830 value 35549224.355189
## iter4840 value 35548005.242872
## iter4850 value 35546428.143204
## iter4860 value 35545817.598336
## iter4870 value 35542856.714897
## iter4880 value 35540119.317696
## iter4890 value 35537056.283580
## iter4900 value 35535371.889754
## iter4910 value 35533808.109805
## iter4920 value 35531825.289901
## iter4930 value 35529904.915196
## iter4940 value 35528203.923068
## iter4950 value 35527718.446946
## iter4960 value 35527101.438905
## iter4970 value 35526559.665075
## iter4980 value 35526074.399559
## iter4990 value 35525452.372632
## iter5000 value 35524796.123247
## final  value 35524796.123247 
## stopped after 5000 iterations
testset=subset(test, select = c("season", "yr" , "mnth" , "hr" , "holiday" , "weekday" , "workingday" , "weathersit" , "temp" , "atemp" , "hum" , "windspeed"))

neural.cnt = predict(neural.model, testset, type="raw")

test$neural.cnt=neural.cnt
# Compute MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 4692.282
# Plot to check result
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result

# Change negative result to positive
test$neural.cnt[test$neural.cnt < 0] = 0
# Compute new MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 4518.081
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result

Step2: Sepeate models for registered and casual

# Model for Registered
# Take off weedspeed and holiday yeild to the best result
neural.registered.formula = registered~season + yr + mnth + hr + weekday + workingday+ weathersit + temp + atemp + hum
neural.model.registered = nnet(neural.registered.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights:  241
## initial  value 568686130.343306 
## iter  10 value 224799565.275934
## iter  20 value 194194898.396173
## iter  30 value 170533844.130250
## iter  40 value 160711617.275227
## iter  50 value 150633279.264054
## iter  60 value 140505887.758522
## iter  70 value 136342363.991998
## iter  80 value 131076423.221615
## iter  90 value 128382963.954093
## iter 100 value 127001110.380051
## iter 110 value 125358475.051638
## iter 120 value 123280037.651347
## iter 130 value 122322042.406086
## iter 140 value 121103625.573018
## iter 150 value 120269936.052892
## iter 160 value 119399754.438662
## iter 170 value 118821544.045348
## iter 180 value 118428125.019129
## iter 190 value 117793714.695103
## iter 200 value 117421597.430252
## iter 210 value 117141899.944874
## iter 220 value 116910419.527180
## iter 230 value 116546901.790359
## iter 240 value 116146375.033531
## iter 250 value 115837372.818173
## iter 260 value 115456621.413242
## iter 270 value 115239755.805721
## iter 280 value 114983999.130343
## iter 290 value 114787333.262960
## iter 300 value 114626610.339350
## iter 310 value 114339074.643109
## iter 320 value 114121281.525322
## iter 330 value 113942955.121875
## iter 340 value 113810679.790881
## iter 350 value 113636271.304112
## iter 360 value 113471359.104317
## iter 370 value 113324540.892058
## iter 380 value 113122892.739943
## iter 390 value 112932534.924182
## iter 400 value 112627641.244006
## iter 410 value 112222611.951328
## iter 420 value 111601706.485061
## iter 430 value 111059477.486920
## iter 440 value 104896513.362104
## iter 450 value 101444833.924661
## iter 460 value 98430853.467490
## iter 470 value 94783485.081091
## iter 480 value 92638794.426826
## iter 490 value 91380898.080639
## iter 500 value 90060759.025863
## iter 510 value 88188852.794439
## iter 520 value 86850753.270327
## iter 530 value 85669171.070149
## iter 540 value 84137775.599222
## iter 550 value 83065939.632805
## iter 560 value 82283239.421975
## iter 570 value 81264892.792213
## iter 580 value 80454428.471694
## iter 590 value 79283640.916300
## iter 600 value 78574142.832668
## iter 610 value 77720391.728255
## iter 620 value 76075589.932280
## iter 630 value 75710107.270133
## iter 640 value 74240170.429274
## iter 650 value 72741708.095957
## iter 660 value 71800505.303411
## iter 670 value 70805749.124546
## iter 680 value 70269177.195443
## iter 690 value 69862613.501493
## iter 700 value 69453248.539112
## iter 710 value 68937798.194501
## iter 720 value 68489212.621475
## iter 730 value 67956099.667945
## iter 740 value 67452545.438038
## iter 750 value 67044563.212355
## iter 760 value 66646259.402140
## iter 770 value 65990851.575152
## iter 780 value 64817028.772469
## iter 790 value 63713789.367602
## iter 800 value 62513688.676355
## iter 810 value 60760940.100978
## iter 820 value 59913282.830042
## iter 830 value 59272869.686933
## iter 840 value 58632273.005806
## iter 850 value 57510583.444708
## iter 860 value 56145554.925658
## iter 870 value 55434067.024778
## iter 880 value 54838616.285319
## iter 890 value 54323641.422294
## iter 900 value 53624080.549460
## iter 910 value 52956707.346642
## iter 920 value 52635372.364303
## iter 930 value 51768228.681180
## iter 940 value 49991741.172698
## iter 950 value 48431040.497803
## iter 960 value 47251983.361385
## iter 970 value 46406362.416056
## iter 980 value 45255677.910001
## iter 990 value 44396462.640040
## iter1000 value 43453833.500133
## iter1010 value 42246755.438191
## iter1020 value 41300827.155338
## iter1030 value 40181828.059796
## iter1040 value 38802226.846951
## iter1050 value 36659224.436161
## iter1060 value 35835862.424168
## iter1070 value 35126306.773055
## iter1080 value 34744471.938399
## iter1090 value 34409748.697047
## iter1100 value 34190840.598433
## iter1110 value 34046551.933988
## iter1120 value 33947549.198783
## iter1130 value 33813822.213267
## iter1140 value 33527859.164088
## iter1150 value 33442713.460972
## iter1160 value 33383658.102691
## iter1170 value 33106531.463549
## iter1180 value 33013919.252271
## iter1190 value 32917312.572135
## iter1200 value 32780783.952370
## iter1210 value 32672947.055460
## iter1220 value 32287881.669201
## iter1230 value 31833471.819893
## iter1240 value 31563191.286803
## iter1250 value 31424973.858167
## iter1260 value 31325370.189411
## iter1270 value 31226935.838929
## iter1280 value 31100518.507047
## iter1290 value 30968224.901222
## iter1300 value 30835138.017051
## iter1310 value 30782778.541381
## iter1320 value 30662402.074133
## iter1330 value 30599500.615899
## iter1340 value 30431708.636013
## iter1350 value 30085894.123280
## iter1360 value 29847754.132554
## iter1370 value 29550537.749332
## iter1380 value 29308610.874330
## iter1390 value 29102286.353159
## iter1400 value 28860305.526220
## iter1410 value 28510961.921467
## iter1420 value 28057774.401506
## iter1430 value 27740677.778889
## iter1440 value 27654613.382901
## iter1450 value 27591030.150826
## iter1460 value 27536666.609650
## iter1470 value 27466681.927161
## iter1480 value 27309354.697418
## iter1490 value 27149993.676055
## iter1500 value 27040181.734499
## iter1510 value 26977654.117402
## iter1520 value 26925982.303895
## iter1530 value 26863930.046467
## iter1540 value 26760689.780486
## iter1550 value 26626133.950351
## iter1560 value 26420247.688411
## iter1570 value 26075580.375922
## iter1580 value 25822384.225444
## iter1590 value 25601607.781477
## iter1600 value 25368897.272406
## iter1610 value 25170490.625616
## iter1620 value 25028139.287094
## iter1630 value 24825284.890338
## iter1640 value 24577194.480514
## iter1650 value 24346081.469685
## iter1660 value 24108864.919947
## iter1670 value 23888273.461398
## iter1680 value 23764230.558959
## iter1690 value 23636674.595007
## iter1700 value 23563332.657414
## iter1710 value 23511281.168031
## iter1720 value 23467058.758722
## iter1730 value 23433487.607479
## iter1740 value 23398064.530300
## iter1750 value 23366201.700543
## iter1760 value 23348061.092812
## iter1770 value 23331966.574344
## iter1780 value 23316625.199330
## iter1790 value 23283045.445964
## iter1800 value 23212762.839882
## iter1810 value 23199601.740732
## iter1820 value 23149277.537599
## iter1830 value 23031535.817209
## iter1840 value 22986124.312328
## iter1850 value 22965050.407372
## iter1860 value 22964166.868582
## iter1870 value 22963342.532504
## iter1880 value 22962256.595977
## iter1890 value 22959389.144952
## iter1900 value 22953435.618915
## iter1910 value 22945958.213617
## iter1920 value 22938575.605686
## iter1930 value 22935894.883595
## iter1940 value 22933076.146926
## iter1950 value 22931156.101491
## iter1960 value 22930119.979525
## iter1970 value 22928338.517067
## iter1980 value 22926169.610272
## iter1990 value 22920672.288320
## iter2000 value 22917352.162668
## iter2010 value 22915392.571248
## iter2020 value 22912890.518980
## iter2030 value 22910105.943298
## iter2040 value 22906684.916300
## iter2050 value 22902058.785450
## iter2060 value 22894934.741419
## iter2070 value 22894769.390257
## iter2080 value 22894017.619083
## iter2090 value 22893645.995594
## iter2100 value 22893158.792575
## iter2110 value 22892474.029236
## iter2120 value 22891293.912287
## iter2130 value 22890112.150415
## iter2140 value 22888728.224902
## iter2150 value 22885886.694130
## iter2160 value 22880879.411050
## iter2170 value 22874244.520433
## iter2180 value 22871111.912893
## iter2190 value 22868273.755771
## iter2200 value 22864857.553996
## iter2210 value 22854868.178776
## iter2220 value 22841422.976174
## iter2230 value 22828038.167142
## iter2240 value 22813608.164669
## iter2250 value 22794518.282414
## iter2260 value 22781650.038037
## iter2270 value 22773062.480465
## iter2280 value 22765433.936657
## iter2290 value 22754363.262957
## iter2300 value 22740885.580761
## iter2310 value 22661988.841929
## iter2320 value 22439888.524533
## iter2330 value 22392432.663666
## iter2340 value 22374906.752904
## iter2350 value 22363697.201925
## iter2360 value 22347027.524570
## iter2370 value 22335636.808002
## iter2380 value 22329324.723144
## iter2390 value 22324590.120690
## iter2400 value 22322884.856659
## iter2410 value 22322029.399783
## iter2420 value 22321771.755151
## iter2430 value 22321563.553685
## iter2440 value 22321520.333225
## final  value 22321518.376554 
## converged
neural.registered = predict(neural.model.registered, testset, type="raw")
test$neural.registered = neural.registered
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3532.43
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result

test$neural.registered[test$neural.registered < 0] = 0
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3498.999
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result

# Model for Casual
neural.casual.formula = casual~season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model.casual = nnet(neural.casual.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights:  281
## initial  value 45214417.527726 
## iter  10 value 25783493.756270
## iter  20 value 22962102.714545
## iter  30 value 16358974.805088
## iter  40 value 15483419.331333
## iter  50 value 15078790.652825
## iter  60 value 13779035.637621
## iter  70 value 13305914.648520
## iter  80 value 12934003.414897
## iter  90 value 12800985.354611
## iter 100 value 12509742.874950
## iter 110 value 12306701.076024
## iter 120 value 12131452.035006
## iter 130 value 12053021.648556
## iter 140 value 11983109.590623
## iter 150 value 11887118.858882
## iter 160 value 11727088.933557
## iter 170 value 11285499.685422
## iter 180 value 10699923.927553
## iter 190 value 10267727.062433
## iter 200 value 10006026.126844
## iter 210 value 9843368.844550
## iter 220 value 9738816.580881
## iter 230 value 9540334.043399
## iter 240 value 9300095.902442
## iter 250 value 9088914.439828
## iter 260 value 8749717.702616
## iter 270 value 8548689.717004
## iter 280 value 8361239.749458
## iter 290 value 8162263.094508
## iter 300 value 8074811.202164
## iter 310 value 7964353.603733
## iter 320 value 7831518.483539
## iter 330 value 7733002.923687
## iter 340 value 7646911.817943
## iter 350 value 7580150.018505
## iter 360 value 7509823.435954
## iter 370 value 7361689.774464
## iter 380 value 7252306.864039
## iter 390 value 7153840.595112
## iter 400 value 7082238.262624
## iter 410 value 6939310.737458
## iter 420 value 6869245.509823
## iter 430 value 6826925.870942
## iter 440 value 6798973.319214
## iter 450 value 6777256.435256
## iter 460 value 6748886.392163
## iter 470 value 6734499.785346
## iter 480 value 6716341.605907
## iter 490 value 6673480.641991
## iter 500 value 6653973.033608
## iter 510 value 6557841.210427
## iter 520 value 6447766.584268
## iter 530 value 6357816.035876
## iter 540 value 6224701.276552
## iter 550 value 6107851.195948
## iter 560 value 5954178.763245
## iter 570 value 5760484.971669
## iter 580 value 5597390.524754
## iter 590 value 5519686.792756
## iter 600 value 5486457.395309
## iter 610 value 5453282.747432
## iter 620 value 5426984.439661
## iter 630 value 5398876.638860
## iter 640 value 5355504.034493
## iter 650 value 5308783.030668
## iter 660 value 5263009.152591
## iter 670 value 5229696.531953
## iter 680 value 5194169.521916
## iter 690 value 5146692.401054
## iter 700 value 5107869.123862
## iter 710 value 5075166.544935
## iter 720 value 5026504.098210
## iter 730 value 4986398.077278
## iter 740 value 4951505.847573
## iter 750 value 4912434.775103
## iter 760 value 4859121.101555
## iter 770 value 4843000.917812
## iter 780 value 4830431.970243
## iter 790 value 4819516.988498
## iter 800 value 4806590.632776
## iter 810 value 4787551.294771
## iter 820 value 4756784.775086
## iter 830 value 4718180.548716
## iter 840 value 4695069.267871
## iter 850 value 4659321.794708
## iter 860 value 4626455.838325
## iter 870 value 4603051.210041
## iter 880 value 4584934.027218
## iter 890 value 4566991.703409
## iter 900 value 4553145.249418
## iter 910 value 4551541.089303
## iter 920 value 4547737.475402
## iter 930 value 4544464.295164
## iter 940 value 4540956.630788
## iter 950 value 4534661.961958
## iter 960 value 4524831.950040
## iter 970 value 4516153.088589
## iter 980 value 4512660.261463
## iter 990 value 4506975.902519
## iter1000 value 4495678.399521
## iter1010 value 4484593.490755
## iter1020 value 4467433.439451
## iter1030 value 4444205.356998
## iter1040 value 4431006.906468
## iter1050 value 4410384.732804
## iter1060 value 4379567.727729
## iter1070 value 4341808.360800
## iter1080 value 4314928.252094
## iter1090 value 4294790.639326
## iter1100 value 4274064.028811
## iter1110 value 4242902.439389
## iter1120 value 4217833.959182
## iter1130 value 4190896.530046
## iter1140 value 4174981.323867
## iter1150 value 4169472.795145
## iter1160 value 4166575.582223
## iter1170 value 4162707.677102
## iter1180 value 4159334.656227
## iter1190 value 4152492.844283
## iter1200 value 4144540.884064
## iter1210 value 4136460.475406
## iter1220 value 4126135.160457
## iter1230 value 4122407.059942
## iter1240 value 4118787.703031
## iter1250 value 4111762.846614
## iter1260 value 4103697.203238
## iter1270 value 4088206.017027
## iter1280 value 4075833.265829
## iter1290 value 4066430.774257
## iter1300 value 4060506.064070
## iter1310 value 4052736.851262
## iter1320 value 4041811.217628
## iter1330 value 4031425.210711
## iter1340 value 4026749.903934
## iter1350 value 4025498.341836
## iter1360 value 4023869.508571
## iter1370 value 4021404.512217
## iter1380 value 4019915.782875
## iter1390 value 4018442.713556
## iter1400 value 4017234.919918
## iter1410 value 4016262.089610
## iter1420 value 4014921.324723
## iter1430 value 4014101.128800
## iter1440 value 4012600.162725
## iter1450 value 4010529.341469
## iter1460 value 4008086.602237
## iter1470 value 4004651.007804
## iter1480 value 4000344.429574
## iter1490 value 3994683.163734
## iter1500 value 3990199.690352
## iter1510 value 3984001.826169
## iter1520 value 3976405.268943
## iter1530 value 3967416.023342
## iter1540 value 3956987.962405
## iter1550 value 3947505.293533
## iter1560 value 3936936.614851
## iter1570 value 3931312.705042
## iter1580 value 3931097.277075
## iter1590 value 3930904.960513
## iter1600 value 3929773.584138
## iter1610 value 3928536.534401
## iter1620 value 3928218.698989
## iter1630 value 3927814.538134
## iter1640 value 3927402.352442
## iter1650 value 3926800.333832
## iter1660 value 3926180.964609
## iter1670 value 3925729.976555
## iter1680 value 3925143.085217
## iter1690 value 3924221.980571
## iter1700 value 3921462.189236
## iter1710 value 3917818.409777
## iter1720 value 3915092.724139
## iter1730 value 3911410.525292
## iter1740 value 3906676.938087
## iter1750 value 3901631.521972
## iter1760 value 3896365.526685
## iter1770 value 3891677.778219
## iter1780 value 3888163.829274
## iter1790 value 3884195.442252
## iter1800 value 3877544.108921
## iter1810 value 3872231.960313
## iter1820 value 3864672.353021
## iter1830 value 3859392.627049
## iter1840 value 3854192.220220
## iter1850 value 3846746.805129
## iter1860 value 3836808.198884
## iter1870 value 3820532.696806
## iter1880 value 3790393.541514
## iter1890 value 3752556.392863
## iter1900 value 3735919.175273
## iter1910 value 3726578.753731
## iter1920 value 3715237.782154
## iter1930 value 3704886.348018
## iter1940 value 3699151.815490
## iter1950 value 3692996.103018
## iter1960 value 3687724.276450
## iter1970 value 3675444.387122
## iter1980 value 3649861.598918
## iter1990 value 3631135.088745
## iter2000 value 3611972.056292
## iter2010 value 3593527.890049
## iter2020 value 3580408.562802
## iter2030 value 3575465.123942
## iter2040 value 3572211.026915
## iter2050 value 3564653.738449
## iter2060 value 3559072.499292
## iter2070 value 3555467.170889
## iter2080 value 3543808.758279
## iter2090 value 3501668.706058
## iter2100 value 3465438.454287
## iter2110 value 3440142.268250
## iter2120 value 3424005.108467
## iter2130 value 3395708.635432
## iter2140 value 3391161.937708
## iter2150 value 3390046.355434
## iter2160 value 3388617.337674
## iter2170 value 3386548.019477
## iter2180 value 3384379.152262
## iter2190 value 3381495.259089
## iter2200 value 3378097.545772
## iter2210 value 3376838.747234
## iter2220 value 3376024.073411
## iter2230 value 3374962.204020
## iter2240 value 3373413.345884
## iter2250 value 3371340.000523
## iter2260 value 3369582.691673
## iter2270 value 3368525.743365
## iter2280 value 3366721.780358
## iter2290 value 3362663.239400
## iter2300 value 3355778.371249
## iter2310 value 3346688.722351
## iter2320 value 3335279.454434
## iter2330 value 3326113.393524
## iter2340 value 3313314.449856
## iter2350 value 3303009.193281
## iter2360 value 3288566.708618
## iter2370 value 3271329.516374
## iter2380 value 3257206.944694
## iter2390 value 3255627.379885
## iter2400 value 3254923.894913
## iter2410 value 3253676.368964
## iter2420 value 3252495.312728
## iter2430 value 3251105.711448
## iter2440 value 3249411.129474
## iter2450 value 3248240.346191
## iter2460 value 3247380.881804
## iter2470 value 3246288.725664
## iter2480 value 3245476.066953
## iter2490 value 3244954.501926
## iter2500 value 3243967.544717
## iter2510 value 3243315.543059
## iter2520 value 3242517.337090
## iter2530 value 3240843.684409
## iter2540 value 3239110.654068
## iter2550 value 3235861.000880
## iter2560 value 3231090.358974
## iter2570 value 3226330.246352
## iter2580 value 3220304.966729
## iter2590 value 3209694.249868
## iter2600 value 3199657.704166
## iter2610 value 3183759.646358
## iter2620 value 3169101.262506
## iter2630 value 3160570.555632
## iter2640 value 3152608.308889
## iter2650 value 3144894.380890
## iter2660 value 3136228.238317
## iter2670 value 3131131.825520
## iter2680 value 3126564.176290
## iter2690 value 3117391.884958
## iter2700 value 3109175.896790
## iter2710 value 3104306.272823
## iter2720 value 3096846.620565
## iter2730 value 3086922.804714
## iter2740 value 3079400.409189
## iter2750 value 3070620.436893
## iter2760 value 3069409.069562
## iter2770 value 3067639.976095
## iter2780 value 3066989.574984
## iter2790 value 3066492.365718
## iter2800 value 3065834.139545
## iter2810 value 3065010.212070
## iter2820 value 3064583.958528
## iter2830 value 3064111.180737
## iter2840 value 3063459.960425
## iter2850 value 3062831.944474
## iter2860 value 3062266.900539
## iter2870 value 3061959.231378
## iter2880 value 3061769.590421
## iter2890 value 3061522.140490
## iter2900 value 3060968.867623
## iter2910 value 3060266.395464
## iter2920 value 3059146.094958
## iter2930 value 3058394.631878
## iter2940 value 3057064.514652
## iter2950 value 3056082.776272
## iter2960 value 3054805.188309
## iter2970 value 3053942.398245
## iter2980 value 3053325.214000
## iter2990 value 3052282.682880
## iter3000 value 3051231.078722
## iter3010 value 3049978.966722
## iter3020 value 3048185.089731
## iter3030 value 3046359.097667
## iter3040 value 3044846.632130
## iter3050 value 3040735.497704
## iter3060 value 3036674.778528
## iter3070 value 3035085.277855
## iter3080 value 3034462.855518
## iter3090 value 3034265.348796
## iter3100 value 3034247.820748
## iter3110 value 3034219.139840
## iter3120 value 3034179.503822
## iter3130 value 3034108.742281
## iter3140 value 3034006.475035
## iter3150 value 3033784.475263
## iter3160 value 3033470.586892
## iter3170 value 3033258.235492
## iter3180 value 3033167.790024
## iter3190 value 3033060.565063
## iter3200 value 3032952.185571
## iter3210 value 3032680.603648
## iter3220 value 3032320.979293
## iter3230 value 3032007.831483
## iter3240 value 3031557.109973
## iter3250 value 3030606.652430
## iter3260 value 3028644.102134
## iter3270 value 3027256.669538
## iter3280 value 3026026.905622
## iter3290 value 3024176.581816
## iter3300 value 3023129.200853
## iter3310 value 3022616.041283
## iter3320 value 3022398.285129
## iter3330 value 3022219.452207
## iter3340 value 3022032.601744
## iter3350 value 3021928.250158
## iter3360 value 3021832.199856
## iter3370 value 3021740.194871
## iter3380 value 3021679.640545
## iter3390 value 3021663.906033
## iter3400 value 3021662.449575
## iter3410 value 3021660.640546
## iter3420 value 3021655.153558
## iter3430 value 3021646.536963
## final  value 3021645.984970 
## converged
neural.casual = predict(neural.model.casual, testset, type="raw")
test$neural.casual = neural.casual
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 2323.752
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result

test$neural.casual[test$neural.casual < 0] = 0
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 2302.934
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result

Step3: Now combine the predicted registered users and casual users

test$neural.combined = test$neural.casual + test$neural.registered
neural.combined.MSE = sum((test$cnt-test$neural.combined)^2)/nrow(test)
neural.combined.MSE
## [1] 7194.954

Factorize data for the rest of models

# Factorization of training data
train$season <- factor(train$season)
train$yr <- factor(train$yr)
train$mnth <- factor(train$mnth)
train$hr <- factor(train$hr)
train$holiday <- factor(train$holiday)
train$weekday<- factor(train$weekday)
train$workingday <- factor(train$workingday)
train$weathersit <- factor(train$weathersit)

# Factorization of test data
test$season <- factor(test$season)
test$yr <- factor(test$yr)
test$mnth <- factor(test$mnth)
test$hr <- factor(test$hr)
test$holiday <- factor(test$holiday)
test$weekday<- factor(test$weekday)
test$workingday <- factor(test$workingday)
test$weathersit <- factor(test$weathersit)

(2) Linear Regression

Step1: Orignal model

# Orignal Model
lm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + 
    workingday + weathersit + temp + atemp + hum + windspeed
fit.lm = lm(lm.formula,data=train)
summary(fit.lm)
## 
## Call:
## lm(formula = lm.formula, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -357.98  -60.35   -7.81   50.80  436.42 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -87.367      7.970 -10.962  < 2e-16 ***
## season2       30.622     15.167   2.019 0.043509 *  
## season3        3.532     21.332   0.166 0.868512    
## season4       37.104     14.999   2.474 0.013383 *  
## yr1           87.418      1.859  47.022  < 2e-16 ***
## mnth2         11.189      4.568   2.449 0.014336 *  
## mnth3         29.450      4.939   5.963 2.55e-09 ***
## mnth4         24.124     15.883   1.519 0.128824    
## mnth5         47.844     15.993   2.991 0.002782 ** 
## mnth6         35.138     16.281   2.158 0.030926 *  
## mnth7         35.805     22.074   1.622 0.104820    
## mnth8         49.474     22.012   2.248 0.024623 *  
## mnth9         77.941     21.826   3.571 0.000357 ***
## mnth10        60.024     16.067   3.736 0.000188 ***
## mnth11        42.706     15.780   2.706 0.006812 ** 
## mnth12        38.384     15.071   2.547 0.010883 *  
## hr1          -17.267      6.337  -2.725 0.006441 ** 
## hr2          -27.562      6.357  -4.336 1.47e-05 ***
## hr3          -38.581      6.416  -6.013 1.88e-09 ***
## hr4          -39.780      6.396  -6.219 5.16e-10 ***
## hr5          -24.003      6.364  -3.772 0.000163 ***
## hr6           35.766      6.353   5.630 1.84e-08 ***
## hr7          170.904      6.344  26.938  < 2e-16 ***
## hr8          314.405      6.335  49.630  < 2e-16 ***
## hr9          166.689      6.341  26.287  < 2e-16 ***
## hr10         110.435      6.366  17.348  < 2e-16 ***
## hr11         137.322      6.412  21.417  < 2e-16 ***
## hr12         177.658      6.463  27.490  < 2e-16 ***
## hr13         173.189      6.516  26.580  < 2e-16 ***
## hr14         156.575      6.558  23.876  < 2e-16 ***
## hr15         167.026      6.569  25.426  < 2e-16 ***
## hr16         230.234      6.554  35.127  < 2e-16 ***
## hr17         387.221      6.517  59.413  < 2e-16 ***
## hr18         352.712      6.472  54.495  < 2e-16 ***
## hr19         240.526      6.408  37.538  < 2e-16 ***
## hr20         158.888      6.373  24.931  < 2e-16 ***
## hr21         108.764      6.346  17.138  < 2e-16 ***
## hr22          72.833      6.335  11.497  < 2e-16 ***
## hr23          33.563      6.330   5.302 1.17e-07 ***
## holiday1      -5.153      5.805  -0.888 0.374687    
## weekday1       4.933      3.568   1.382 0.166908    
## weekday2      10.852      3.447   3.149 0.001644 ** 
## weekday3      10.816      3.448   3.137 0.001711 ** 
## weekday4      11.964      3.441   3.477 0.000510 ***
## weekday5      18.493      3.448   5.364 8.30e-08 ***
## weekday6      20.338      3.420   5.946 2.82e-09 ***
## workingday1       NA         NA      NA       NA    
## weathersit2  -11.980      2.258  -5.306 1.14e-07 ***
## weathersit3  -65.555      3.806 -17.223  < 2e-16 ***
## weathersit4  -51.508     71.197  -0.723 0.469414    
## temp         108.484     33.430   3.245 0.001177 ** 
## atemp        110.132     34.130   3.227 0.001255 ** 
## hum          -80.389      6.680 -12.035  < 2e-16 ***
## windspeed    -33.401      8.380  -3.986 6.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.4 on 11981 degrees of freedom
## Multiple R-squared:  0.6942, Adjusted R-squared:  0.6929 
## F-statistic: 523.1 on 52 and 11981 DF,  p-value: < 2.2e-16
lm.cnt=predict(fit.lm, test)
test$lm.cnt = lm.cnt
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 11297.42
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result

test$lm.cnt[test$lm.cnt < 0] = 0
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 10738.16
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result

Step2: Sepeate models for registered and casual

# Model for Registered
# Windspeed and atemp are taken out
lm.registered.formula = registered ~ season + yr + mnth + hr + holiday  + 
    workingday + weathersit + temp + atemp + hum
lm.model.registered = lm(lm.registered.formula,data=train)
lm.registered = predict(lm.model.registered,test)
test$lm.registered = lm.registered
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 8157.388
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result

test$lm.registered[test$lm.registered < 0] = 0
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 7763.378
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result

# Model for Casual
# Windspeed is taken out
lm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum 
lm.model.casual = lm(lm.casual.formula, data = train)
lm.casual = predict(lm.model.casual,test)
test$lm.casual = lm.casual
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 998.3204
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result

test$lm.casual[test$lm.casual < 0] = 0
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 913.7805
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result

Step3: Now combine the predicted registered users and casual users

test$lm.combined = test$lm.casual + test$lm.registered
lm.combined.MSE = sum((test$cnt-test$lm.combined)^2)/nrow(test)
lm.combined.MSE
## [1] 10684.49

(3) SVM

Step1: Orignal model

# Orignal Model 
svm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model = svm(svm.formula, data = train)
svm.cnt=predict(svm.model, test)
test$svm.cnt = svm.cnt
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7546.31
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result

test$svm.cnt[test$svm.cnt < 0] = 0
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7511.831
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result

Step2: Sepeate models for registered and casual

# Model for Registered
svm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday + temp + atemp + hum 
svm.model.registered = svm(svm.registered.formula, data = train)
svm.registered = predict(svm.model.registered,test)
test$svm.registered = svm.registered
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5870.63
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result

test$svm.registered[test$svm.registered < 0] = 0
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5854.001
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result

# Model for Casual
svm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model.casual = svm(svm.casual.formula, data = train)
svm.casual = predict(svm.model.casual,test)
test$svm.casual = svm.casual
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 650.8312
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result

test$svm.casual[test$svm.casual < 0] = 0
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 647.9083
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result

Step3: Now combine the predicted registered users and casual users

test$svm.combined = test$svm.casual + test$svm.registered
svm.combined.MSE = sum((test$cnt-test$svm.combined)^2)/nrow(test)
svm.combined.MSE
## [1] 7655.972

(4) Random Forest

Step1: Orignal model

# Orignal Model 
forest.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model = randomForest(forest.formula, data = train, importance = TRUE, ntree = 200)
forest.model$importance
##                %IncMSE IncNodePurity
## season      2011.56930     9278210.6
## yr          6192.83240    30285109.1
## mnth        3056.79334    17557971.9
## hr         38472.46107   209246414.1
## holiday       92.64652      880980.2
## weekday     3876.14948    17131864.1
## workingday  4389.96272    13304939.1
## weathersit   872.92779     6161000.9
## temp        4525.47739    25254204.6
## atemp       4557.41799    30696815.7
## hum         3484.10188    22285222.4
## windspeed    355.81994     6326822.3
forest.cnt = predict(forest.model,test)
test$forest.cnt = forest.cnt
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3798.602
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result

test$forest.cnt[test$forest.cnt < 0] = 0
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3798.602
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result

Step2: Sepeate models for registered and casual

# Model for Registered
forest.registered.formula = registered ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.registered = randomForest(forest.registered.formula, data = train, importance = TRUE, ntree = 200)
forest.model.registered$importance
##                %IncMSE IncNodePurity
## season      1408.11543     6510456.6
## yr          4731.70414    22483109.8
## mnth        2138.89288    11764156.9
## hr         29744.78947   153690889.0
## holiday       68.59201      646385.2
## weekday     3138.03807    14667232.8
## workingday  4044.50293    13589716.1
## weathersit   646.85498     4381252.9
## temp        2221.16143    10979597.2
## atemp       2409.19034    14759622.3
## hum         1839.72849    12772305.1
## windspeed    232.38954     4118188.1
forest.registered = predict(forest.model.registered,test)
test$forest.registered = forest.registered
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2619.906
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result

test$forest.registered[test$forest.registered < 0] = 0
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2619.906
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result

# Model for Casual
forest.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.casual = randomForest(forest.casual.formula, data = train, importance = TRUE, ntree = 160)
forest.model.casual$importance
##               %IncMSE IncNodePurity
## season      146.91900      520619.5
## yr          237.04584      994050.4
## mnth        354.25313     1958205.0
## hr         1880.66235     9826946.5
## holiday      29.69380      171391.2
## weekday     753.85289     2932163.9
## workingday  841.01505     3023545.2
## weathersit   48.05173      316445.6
## temp        579.24601     3081440.2
## atemp       668.77214     4066669.9
## hum         311.39555     2149306.1
## windspeed    33.85808      528132.4
forest.casual = predict(forest.model.casual,test)
test$forest.casual = forest.casual
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 379.7289
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result

test$forest.casual[test$forest.casual < 0] = 0
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 379.7289
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result

Step3: Now combine the predicted registered users and casual users

test$forest.combined = test$forest.casual + test$forest.registered
forest.combined.MSE = sum((test$cnt-test$forest.combined)^2)/nrow(test)
forest.combined.MSE
## [1] 3515.571

(5) GBM

Step1: Orignal model

# Orignal Model 
gbm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model = gbm(gbm.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model)

##                   var    rel.inf
## hr                 hr 59.5640798
## yr                 yr 10.1021434
## workingday workingday  7.3482236
## mnth             mnth  5.7626710
## temp             temp  5.0394995
## weekday       weekday  3.5016279
## atemp           atemp  2.5713378
## season         season  2.0316668
## weathersit weathersit  1.9001772
## hum               hum  1.6908947
## windspeed   windspeed  0.3362276
## holiday       holiday  0.1514507
pef.trees = gbm.perf(gbm.model)
## Using OOB method...

gbm.cnt = predict(gbm.model, newdata=test, n.trees=pef.trees)
test$gbm.cnt = gbm.cnt
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3572.765
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result

test$gbm.cnt[test$gbm.cnt < 0] = 0
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3556.995
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result

Step2: Sepeate models for registered and casual

# Model for Registered
# Take off holiday, weedspeed and hum give the best result
gbm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday+ weathersit + temp
gbm.model.registered = gbm(gbm.registered.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.registered)

##                   var   rel.inf
## hr                 hr 59.422886
## workingday workingday 11.098504
## yr                 yr 10.946578
## mnth             mnth  5.802469
## weekday       weekday  4.369740
## temp             temp  3.823567
## weathersit weathersit  2.524691
## season         season  2.011565
perf.trees.registered = gbm.perf(gbm.model.registered)
## Using OOB method...

gbm.registered = predict(gbm.model.registered,newdata=test,n.trees = perf.trees.registered)
test$gbm.registered = gbm.registered
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2642.751
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result

test$gbm.registered[test$gbm.registered < 0] = 0
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2636.727
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result

# Model for Casual
gbm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model.casual = gbm(gbm.casual.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.casual)

##                   var    rel.inf
## hr                 hr 34.9461746
## weekday       weekday 12.5017985
## atemp           atemp 11.8526189
## temp             temp 11.7052478
## workingday workingday 11.1590400
## mnth             mnth  7.2706955
## hum               hum  4.1731473
## yr                 yr  4.0130881
## windspeed   windspeed  0.8472661
## weathersit weathersit  0.8341478
## holiday       holiday  0.4924107
## season         season  0.2043647
perf.trees.casual = gbm.perf(gbm.model.casual)
## Using OOB method...

gbm.casual = predict(gbm.model.casual,newdata=test,n.trees = perf.trees.casual)
test$gbm.casual = gbm.casual
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 344.1961
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result

test$gbm.casual[test$gbm.casual < 0] = 0
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 343.1621
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result

Step3: Now combine the predicted registered users and casual users

test$gbm.combined = test$gbm.casual + test$gbm.registered
gbm.combined.MSE = sum((test$cnt-test$gbm.combined)^2)/nrow(test)
gbm.combined.MSE
## [1] 3369.337

(6) Regression Tree

Step1: Original model

# Orignal Model 
formula.cnt <- cnt ~  season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed

fit.rpart.cnt <- rpart(formula.cnt, method="anova", data= train)
print(summary(fit.rpart.cnt))
## Call:
## rpart(formula = formula.cnt, data = train, method = "anova")
##   n= 12034 
## 
##            CP nsplit rel error    xerror        xstd
## 1  0.36074746      0 1.0000000 1.0001864 0.016576271
## 2  0.10238796      1 0.6392525 0.6413939 0.012165579
## 3  0.05755831      2 0.5368646 0.5461442 0.009611061
## 4  0.03971267      3 0.4793063 0.4701578 0.008041499
## 5  0.03050477      4 0.4395936 0.4250503 0.007356162
## 6  0.02638756      5 0.4090888 0.3971615 0.006791662
## 7  0.02181266      6 0.3827013 0.3758256 0.006298499
## 8  0.02006172      7 0.3608886 0.3553995 0.006155750
## 9  0.01945226      8 0.3408269 0.3468612 0.006001831
## 10 0.01611249     10 0.3019223 0.3217805 0.005539678
## 11 0.01412510     11 0.2858099 0.2965783 0.005277021
## 12 0.01332590     12 0.2716848 0.2826774 0.005268221
## 13 0.01123726     13 0.2583589 0.2667504 0.005026355
## 14 0.01000000     14 0.2471216 0.2596578 0.004909933
## 
## Variable importance
##         hr         yr       mnth       temp      atemp        hum 
##         51          9          7          7          7          5 
##     season workingday    weekday  windspeed 
##          5          4          4          1 
## 
## Node number 1: 12034 observations,    complexity param=0.3607475
##   mean=191.5313, MSE=32814.64 
##   left son=2 (4479 obs) right son=3 (7555 obs)
##   Primary splits:
##       hr    splits as  LLLLLLLRRRRRRRRRRRRRRRLL, improve=0.36074750, (0 missing)
##       atemp < 0.5985  to the left,  improve=0.12935650, (0 missing)
##       temp  < 0.55    to the left,  improve=0.10419890, (0 missing)
##       hum   < 0.665   to the right, improve=0.07209467, (0 missing)
##       yr    splits as  LR, improve=0.06667235, (0 missing)
##   Surrogate splits:
##       hum       < 0.765   to the right, agree=0.663, adj=0.094, (0 split)
##       windspeed < 0.09705 to the left,  agree=0.632, adj=0.013, (0 split)
##       temp      < 0.15    to the left,  agree=0.630, adj=0.007, (0 split)
##       atemp     < 0.08335 to the left,  agree=0.629, adj=0.003, (0 split)
## 
## Node number 2: 4479 observations,    complexity param=0.0141251
##   mean=50.22483, MSE=3239.629 
##   left son=4 (2968 obs) right son=5 (1511 obs)
##   Primary splits:
##       hr     splits as  LLLLLLR---------------RR, improve=0.38440840, (0 missing)
##       temp   < 0.47    to the left,  improve=0.06934414, (0 missing)
##       atemp  < 0.4621  to the left,  improve=0.06834863, (0 missing)
##       mnth   splits as  LLLLRRRRRRLL, improve=0.05825211, (0 missing)
##       season splits as  LRRR, improve=0.05093041, (0 missing)
##   Surrogate splits:
##       temp  < 0.77    to the left,  agree=0.666, adj=0.009, (0 split)
##       atemp < 0.73485 to the left,  agree=0.665, adj=0.008, (0 split)
##       hum   < 0.11    to the right, agree=0.663, adj=0.001, (0 split)
## 
## Node number 3: 7555 observations,    complexity param=0.102388
##   mean=275.3052, MSE=31492.39 
##   left son=6 (5036 obs) right son=7 (2519 obs)
##   Primary splits:
##       hr     splits as  -------LRLLLLLLLRRRRLL--, improve=0.1699364, (0 missing)
##       yr     splits as  LR, improve=0.1505464, (0 missing)
##       temp   < 0.49    to the left,  improve=0.1391370, (0 missing)
##       atemp  < 0.47725 to the left,  improve=0.1364355, (0 missing)
##       season splits as  LRRR, improve=0.1233174, (0 missing)
##   Surrogate splits:
##       windspeed  < 0.7985  to the left,  agree=0.667, adj=0.001, (0 split)
##       weathersit splits as  LLLR,        agree=0.667, adj=0.000, (0 split)
##       atemp      < 0.9015  to the left,  agree=0.667, adj=0.000, (0 split)
## 
## Node number 4: 2968 observations
##   mean=25.04549, MSE=933.8123 
## 
## Node number 5: 1511 observations
##   mean=99.68365, MSE=4077.341 
## 
## Node number 6: 5036 observations,    complexity param=0.03971267
##   mean=223.5663, MSE=17665.7 
##   left son=12 (2516 obs) right son=13 (2520 obs)
##   Primary splits:
##       yr     splits as  LR,           improve=0.1762747, (0 missing)
##       temp   < 0.45    to the left,   improve=0.1504354, (0 missing)
##       atemp  < 0.44695 to the left,   improve=0.1476129, (0 missing)
##       season splits as  LRRR,         improve=0.1474771, (0 missing)
##       mnth   splits as  LLLRRRRRRRRR, improve=0.1461857, (0 missing)
##   Surrogate splits:
##       hum        < 0.505   to the right, agree=0.535, adj=0.070, (0 split)
##       temp       < 0.55    to the left,  agree=0.529, adj=0.057, (0 split)
##       atemp      < 0.52275 to the left,  agree=0.528, adj=0.056, (0 split)
##       windspeed  < 0.26865 to the right, agree=0.527, adj=0.053, (0 split)
##       weathersit splits as  RLL-,        agree=0.512, adj=0.024, (0 split)
## 
## Node number 7: 2519 observations,    complexity param=0.05755831
##   mean=378.742, MSE=43083.91 
##   left son=14 (1259 obs) right son=15 (1260 obs)
##   Primary splits:
##       yr     splits as  LR,           improve=0.2094317, (0 missing)
##       temp   < 0.49    to the left,   improve=0.1993776, (0 missing)
##       atemp  < 0.47725 to the left,   improve=0.1949973, (0 missing)
##       season splits as  LRRR,         improve=0.1673958, (0 missing)
##       mnth   splits as  LLLRRRRRRRRR, improve=0.1626678, (0 missing)
##   Surrogate splits:
##       hum        < 0.445   to the right, agree=0.554, adj=0.107, (0 split)
##       atemp      < 0.58335 to the left,  agree=0.533, adj=0.066, (0 split)
##       temp       < 0.59    to the left,  agree=0.532, adj=0.064, (0 split)
##       windspeed  < 0.31345 to the right, agree=0.515, adj=0.030, (0 split)
##       weathersit splits as  RRLR,        agree=0.509, adj=0.018, (0 split)
## 
## Node number 12: 2516 observations,    complexity param=0.01611249
##   mean=167.7186, MSE=9589.757 
##   left son=24 (836 obs) right son=25 (1680 obs)
##   Primary splits:
##       mnth       splits as  LLLLRRRRRRRR, improve=0.2637072, (0 missing)
##       season     splits as  LRRR,         improve=0.2394119, (0 missing)
##       atemp      < 0.4621  to the left,   improve=0.2192312, (0 missing)
##       temp       < 0.47    to the left,   improve=0.2192312, (0 missing)
##       workingday splits as  RL,           improve=0.0625818, (0 missing)
##   Surrogate splits:
##       season    splits as  LRRR,        agree=0.909, adj=0.725, (0 split)
##       atemp     < 0.35605 to the left,  agree=0.793, adj=0.378, (0 split)
##       temp      < 0.39    to the left,  agree=0.791, adj=0.372, (0 split)
##       hum       < 0.405   to the left,  agree=0.688, adj=0.060, (0 split)
##       windspeed < 0.37315 to the right, agree=0.687, adj=0.057, (0 split)
## 
## Node number 13: 2520 observations,    complexity param=0.02006172
##   mean=279.3254, MSE=19505.74 
##   left son=26 (721 obs) right son=27 (1799 obs)
##   Primary splits:
##       temp    < 0.39    to the left,   improve=0.1611695, (0 missing)
##       atemp   < 0.4015  to the left,   improve=0.1571376, (0 missing)
##       mnth    splits as  LLRRRRRRRRRR, improve=0.1559454, (0 missing)
##       season  splits as  LRRR,         improve=0.1512278, (0 missing)
##       weekday splits as  RLLLLLR,      improve=0.0617536, (0 missing)
##   Surrogate splits:
##       atemp  < 0.4015  to the left,   agree=0.996, adj=0.986, (0 split)
##       mnth   splits as  LLRRRRRRRRLL, agree=0.873, adj=0.558, (0 split)
##       season splits as  LRRR,         agree=0.803, adj=0.311, (0 split)
##       hum    < 0.91    to the right,  agree=0.715, adj=0.004, (0 split)
## 
## Node number 14: 1259 observations,    complexity param=0.02181266
##   mean=283.7141, MSE=22177.25 
##   left son=28 (524 obs) right son=29 (735 obs)
##   Primary splits:
##       mnth       splits as  LLLLRRRRRRRL, improve=0.3084984, (0 missing)
##       season     splits as  LRRR,         improve=0.3003280, (0 missing)
##       temp       < 0.47    to the left,   improve=0.2964665, (0 missing)
##       atemp      < 0.4621  to the left,   improve=0.2964665, (0 missing)
##       workingday splits as  LR,           improve=0.1184052, (0 missing)
##   Surrogate splits:
##       temp      < 0.47    to the left,  agree=0.841, adj=0.618, (0 split)
##       atemp     < 0.4621  to the left,  agree=0.841, adj=0.618, (0 split)
##       season    splits as  LRRR,        agree=0.833, adj=0.599, (0 split)
##       hum       < 0.385   to the left,  agree=0.626, adj=0.101, (0 split)
##       windspeed < 0.31345 to the right, agree=0.622, adj=0.092, (0 split)
## 
## Node number 15: 1260 observations,    complexity param=0.03050477
##   mean=473.6944, MSE=45934.88 
##   left son=30 (400 obs) right son=31 (860 obs)
##   Primary splits:
##       workingday splits as  LR,           improve=0.2081288, (0 missing)
##       temp       < 0.43    to the left,   improve=0.1949022, (0 missing)
##       atemp      < 0.4318  to the left,   improve=0.1887979, (0 missing)
##       mnth       splits as  LLRRRRRRRRRR, improve=0.1842011, (0 missing)
##       weekday    splits as  LRRRRRL,      improve=0.1820905, (0 missing)
##   Surrogate splits:
##       weekday splits as  LRRRRRL,     agree=0.968, adj=0.900, (0 split)
##       holiday splits as  RL,          agree=0.714, adj=0.100, (0 split)
##       atemp   < 0.1894  to the left,  agree=0.689, adj=0.020, (0 split)
##       hum     < 0.195   to the left,  agree=0.686, adj=0.010, (0 split)
##       temp    < 0.93    to the right, agree=0.685, adj=0.007, (0 split)
## 
## Node number 24: 836 observations
##   mean=96.43062, MSE=3492.994 
## 
## Node number 25: 1680 observations
##   mean=203.1929, MSE=8836.312 
## 
## Node number 26: 721 observations
##   mean=190.7587, MSE=9708.069 
## 
## Node number 27: 1799 observations,    complexity param=0.01945226
##   mean=314.821, MSE=19028.77 
##   left son=54 (1240 obs) right son=55 (559 obs)
##   Primary splits:
##       workingday splits as  RL, improve=0.13644570, (0 missing)
##       weekday    splits as  RLLLLLR, improve=0.13176200, (0 missing)
##       hr         splits as  -------R-RLRRRRR----RL--, improve=0.05825584, (0 missing)
##       weathersit splits as  RRL-, improve=0.05455847, (0 missing)
##       atemp      < 0.61365 to the left,  improve=0.05099518, (0 missing)
##   Surrogate splits:
##       weekday splits as  RLLLLLR,     agree=0.974, adj=0.916, (0 split)
##       holiday splits as  LR,          agree=0.715, adj=0.084, (0 split)
##       atemp   < 0.85605 to the left,  agree=0.694, adj=0.016, (0 split)
##       temp    < 0.95    to the left,  agree=0.692, adj=0.009, (0 split)
##       hum     < 0.205   to the right, agree=0.690, adj=0.004, (0 split)
## 
## Node number 28: 524 observations
##   mean=185.7519, MSE=11624.71 
## 
## Node number 29: 735 observations
##   mean=353.5537, MSE=17981.19 
## 
## Node number 30: 400 observations,    complexity param=0.01123726
##   mean=330.325, MSE=32555.74 
##   left son=60 (164 obs) right son=61 (236 obs)
##   Primary splits:
##       temp   < 0.49    to the left,  improve=0.3407615, (0 missing)
##       atemp  < 0.47725 to the left,  improve=0.3407615, (0 missing)
##       hr     splits as  --------L-------RRRL----, improve=0.2724685, (0 missing)
##       mnth   splits as  LLRRRRRRRRRL, improve=0.2545665, (0 missing)
##       season splits as  LRRR, improve=0.1791564, (0 missing)
##   Surrogate splits:
##       atemp      < 0.47725 to the left,   agree=1.000, adj=1.000, (0 split)
##       season     splits as  LRRL,         agree=0.878, adj=0.701, (0 split)
##       mnth       splits as  LLRRRRRRRLLL, agree=0.878, adj=0.701, (0 split)
##       hum        < 0.745   to the right,  agree=0.638, adj=0.116, (0 split)
##       weathersit splits as  RRL-,         agree=0.612, adj=0.055, (0 split)
## 
## Node number 31: 860 observations,    complexity param=0.02638756
##   mean=540.3779, MSE=38150.68 
##   left son=62 (344 obs) right son=63 (516 obs)
##   Primary splits:
##       hr     splits as  --------R-------LRRL----, improve=0.3175968, (0 missing)
##       mnth   splits as  LLLRRRRRRRLL, improve=0.2233810, (0 missing)
##       season splits as  LRRR, improve=0.2183222, (0 missing)
##       temp   < 0.49    to the left,  improve=0.1973805, (0 missing)
##       atemp  < 0.47725 to the left,  improve=0.1920992, (0 missing)
##   Surrogate splits:
##       temp      < 0.87    to the right, agree=0.602, adj=0.006, (0 split)
##       atemp     < 0.75    to the right, agree=0.602, adj=0.006, (0 split)
##       hum       < 0.215   to the left,  agree=0.601, adj=0.003, (0 split)
##       windspeed < 0.56715 to the right, agree=0.601, adj=0.003, (0 split)
## 
## Node number 54: 1240 observations
##   mean=280.6089, MSE=8820.401 
## 
## Node number 55: 559 observations,    complexity param=0.01945226
##   mean=390.712, MSE=33317.61 
##   left son=110 (204 obs) right son=111 (355 obs)
##   Primary splits:
##       hr        splits as  -------L-LRRRRRR----LL--, improve=0.57408930, (0 missing)
##       hum       < 0.575   to the right, improve=0.24107720, (0 missing)
##       atemp     < 0.61365 to the left,  improve=0.10204650, (0 missing)
##       temp      < 0.55    to the left,  improve=0.06584068, (0 missing)
##       windspeed < 0.31345 to the left,  improve=0.04022250, (0 missing)
##   Surrogate splits:
##       hum < 0.715   to the right, agree=0.696, adj=0.167, (0 split)
## 
## Node number 60: 164 observations
##   mean=203.9756, MSE=14611.74 
## 
## Node number 61: 236 observations
##   mean=418.1271, MSE=26222.35 
## 
## Node number 62: 344 observations
##   mean=405.564, MSE=17641.27 
## 
## Node number 63: 516 observations,    complexity param=0.0133259
##   mean=630.2539, MSE=31629.39 
##   left son=126 (213 obs) right son=127 (303 obs)
##   Primary splits:
##       mnth       splits as  LLLRRRRRRRLL, improve=0.3224286, (0 missing)
##       temp       < 0.49    to the left,   improve=0.3217399, (0 missing)
##       season     splits as  LRRR,         improve=0.3196353, (0 missing)
##       atemp      < 0.47725 to the left,   improve=0.3146083, (0 missing)
##       weathersit splits as  RRLL,         improve=0.1475443, (0 missing)
##   Surrogate splits:
##       season    splits as  LRRL,        agree=0.913, adj=0.789, (0 split)
##       temp      < 0.49    to the left,  agree=0.888, adj=0.728, (0 split)
##       atemp     < 0.47725 to the left,  agree=0.882, adj=0.714, (0 split)
##       hum       < 0.92    to the right, agree=0.597, adj=0.023, (0 split)
##       windspeed < 0.43285 to the right, agree=0.597, adj=0.023, (0 split)
## 
## Node number 110: 204 observations
##   mean=208.2696, MSE=11069.36 
## 
## Node number 111: 355 observations
##   mean=495.5521, MSE=15983.78 
## 
## Node number 126: 213 observations
##   mean=509.8075, MSE=21483.6 
## 
## Node number 127: 303 observations
##   mean=714.9241, MSE=21394.31 
## 
## n= 12034 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 12034 394891300 191.53130  
##     2) hr=0,1,2,3,4,5,6,22,23 4479  14510300  50.22483  
##       4) hr=0,1,2,3,4,5 2968   2771555  25.04549 *
##       5) hr=6,22,23 1511   6160863  99.68365 *
##     3) hr=7,8,9,10,11,12,13,14,15,16,17,18,19,20,21 7555 237925000 275.30520  
##       6) hr=7,9,10,11,12,13,14,15,20,21 5036  88964490 223.56630  
##        12) yr=0 2516  24127830 167.71860  
##          24) mnth=1,2,3,4 836   2920143  96.43062 *
##          25) mnth=5,6,7,8,9,10,11,12 1680  14845000 203.19290 *
##        13) yr=1 2520  49154470 279.32540  
##          26) temp< 0.39 721   6999518 190.75870 *
##          27) temp>=0.39 1799  34232750 314.82100  
##            54) workingday=1 1240  10937300 280.60890 *
##            55) workingday=0 559  18624540 390.71200  
##             110) hr=7,9,20,21 204   2258150 208.26960 *
##             111) hr=10,11,12,13,14,15 355   5674242 495.55210 *
##       7) hr=8,16,17,18,19 2519 108528400 378.74200  
##        14) yr=0 1259  27921150 283.71410  
##          28) mnth=1,2,3,4,12 524   6091350 185.75190 *
##          29) mnth=5,6,7,8,9,10,11 735  13216170 353.55370 *
##        15) yr=1 1260  57877950 473.69440  
##          30) workingday=0 400  13022300 330.32500  
##            60) temp< 0.49 164   2396326 203.97560 *
##            61) temp>=0.49 236   6188474 418.12710 *
##          31) workingday=1 860  32809580 540.37790  
##            62) hr=16,19 344   6068599 405.56400 *
##            63) hr=8,17,18 516  16320770 630.25390  
##             126) mnth=1,2,3,11,12 213   4576007 509.80750 *
##             127) mnth=4,5,6,7,8,9,10 303   6482477 714.92410 *
# Access significant  vairables
fit.rpart.cnt$variable.importance
##         hr         yr       mnth       temp      atemp        hum 
##  209578416   38411469   27767349   27500369   27021271   20736191 
##     season workingday    weekday  windspeed    holiday weathersit 
##   19497245   16716978   15119647    4642626    1597331    1048780
# Validate the fit.rpart model using testing data 
test.cnt <- test[, "cnt"]
test.x <- test[, 3:14]
rpart.cnt <- predict(fit.rpart.cnt, test.x)
test$rpart.cnt <- rpart.cnt
rpart.MSE = mean((rpart.cnt - test.cnt)^2)
rpart.MSE
## [1] 9523.749
rpart.result = ggplot(test,aes(cnt,rpart.cnt))+geom_point()
rpart.result

Step2: Sepeate models for registered and casual

# Model for Registered
formula.registered <- registered ~  season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed
fit.rpart.registered <- rpart(formula.registered, method="anova", data= train)

test.registered <- test[, "registered"]
rpart.registered <- predict(fit.rpart.registered, test.x)
test$rpart.registered <- rpart.registered
rpart.registered.MSE = mean((rpart.registered - test.registered)^2)
rpart.registered.MSE
## [1] 7096.629
rpart.registered.result = ggplot(test,aes(registered,rpart.registered))+geom_point()
rpart.registered.result

# Model for Casual
formula.casual <- casual ~ season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed

fit.rpart.casual <-  rpart(formula.casual, method="anova", data= train)

test.casual <- test[, "casual"]
rpart.casual <- predict(fit.rpart.casual, test.x)
test$rpart.casual <- rpart.casual 
rpart.casual.MSE = mean((rpart.casual - test.casual)^2)
rpart.casual.MSE
## [1] 661.3148
rpart.casual.result = ggplot(test,aes(casual,rpart.casual))+geom_point()
rpart.casual.result

Step3: Now combine the predicted registered users and casual users

rpart.combined = rpart.casual + rpart.registered
test$rpart.combined <- rpart.combined
rpart.combined.MSE = sum((test$cnt-test$rpart.combined)^2)/nrow(test)
rpart.combined.MSE
## [1] 8568.5

Now let’s compare the results of all the models, the comparism table.

rpart.MSEs <- c(rpart.MSE, rpart.combined.MSE, rpart.registered.MSE, rpart.casual.MSE )
rpart.MSEs <- matrix(rpart.MSEs, nrow= 1, ncol=4) 
colnames(rpart.MSEs) <-c("cnt.MSE", "combined.MSE", "registered.MSE", "casual.MSE" )
rownames(rpart.MSEs) <- "rpart.MSEs"
lm.MSEs = c(lm.MSE, lm.combined.MSE, lm.registered.MSE, lm.casual.MSE)
forest.MSEs =  c(forest.MSE, forest.combined.MSE, forest.registered.MSE, forest.casual.MSE )
svm.MSEs = c(svm.MSE, svm.combined.MSE, svm.registered.MSE, svm.casual.MSE)
neural.MSEs = c(neural.MSE, neural.combined.MSE, neural.registered.MSE, neural.casual.MSE)
gbm.MSEs = c(gbm.MSE, gbm.combined.MSE, gbm.registered.MSE, gbm.casual.MSE)
Summary.MSE=rbind(rpart.MSEs, forest.MSEs, svm.MSEs, neural.MSEs, gbm.MSEs, lm.MSEs)
Summary.MSE
##               cnt.MSE combined.MSE registered.MSE casual.MSE
## rpart.MSEs   9523.749     8568.500       7096.629   661.3148
## forest.MSEs  3798.602     3515.571       2619.906   379.7289
## svm.MSEs     7511.831     7655.972       5854.001   647.9083
## neural.MSEs  4518.081     7194.954       3498.999  2302.9344
## gbm.MSEs     3556.995     3369.337       2636.727   343.1621
## lm.MSEs     10738.163    10684.486       7763.378   913.7805

As shown above, as of registered users, the random forest provides the best prediction, i.e. lowest MSE, and of casual users, GBM yields the best result. Notice predicted results of random forest and GBM contain unwanted negative values, so we have to manually convert them to 0.

This transformation is crucial because the final output we like to generate is the total number of rentals, which are derived from the predicted registered users (by RandomForest) and predicted casual users (by GBM). Since each output of a model includes negative values, we did transformation before adding those predicted values. Finally, with this additional step, we got the least MSE, which essential would be our ideal model.


Explorations


After knowing the relative influence (from gbm) of “cnt”, “registered”, “casual”, we attemped to visualize the story behind the scene.

Important Variables (Relative Influence) cnt: hr, yr, workingday, temp, mnth, weekday….. registered: hr, workingday, yr, mnth, weekday….. casual: hr, weekday, temp, workingday…..

Because of the similarities among these factors, we subjectively grouped them into 4 different factor groups and visualized the plottings against “cnt” (Total users), “registered” (Registered users), and “casual” (Casual users).

Factor groups:
(1) hr
(2) yr, mnth
(3) working, weekday, holiday
(4) temp, atemp, humidity

In terms of the number of attributes, we couldn’t do all the plottings between each two. Therefore, to be logical, we showed the most significant relationships between x’s(e.g. hr, yr ..) and y’s (e.g. cnt, registered, casual).

# Ratio of 2 types of users 
r.registered <- sum(hour.data$registered) / sum(hour.data$cnt)
r.casual <- (1- r.registered)
print(c("Registered %",r.registered ))
## [1] "Registered %"      "0.811698316173547"
print(c("Causal %", r.casual))
## [1] "Causal %"          "0.188301683826453"

Registerd users majorly account for the rental usage. Because “hr” is the most significant factor, let’s see the initial plot between “hr” and “cnt”.

plot(x= hour.data$hr, y= hour.data$cnt)

From this plot, we assume there exists an interesting pattern - peak period. As we can see that the cnt stick out during 7-9am and 5-7pm, which coincides with the peak periods when people go to work in the morning and when people get off from work in the afternoon. Let’s see the finer grained plots.


Plottings


Since hour is the most significant factor for both registered and casual users, let’s see how the cnt, registered and casual fluctuate with the hour.

As the matter of fact that such rush-hours patterns exist, we divided the hour factor into 5 segments to better visualize and understand how rental number changes for both registered users and casual users with the time of a day.

# Create daypart column, default to "Night"
hour.data$daypart <- "Night"
# 0am -7am: "Early morning"
hour.data$daypart[(hour.data$hr >=0 ) & (hour.data$hr <7 )] <- "Early Morning" 
# 7am- 9am : "Peak Morning"
hour.data$daypart[(hour.data$hr >=7 ) & (hour.data$hr <9 )] <- "Peak Morning"
# 9am- 5pm : "Day"
hour.data$daypart[(hour.data$hr >=9 ) & (hour.data$hr <17 )] <- "Day"
# 5pm- 7pm : "Peak Evenning"
hour.data$daypart[(hour.data$hr >=17 ) & (hour.data$hr <20 )] <- "Peak Evening"

# Factorization
hour.data$daypart <- factor(hour.data$daypart)

(1)Hour

# Count by hour
g.cnt.hr <- ggplot(hour.data, aes(x = hr, y = cnt))
g.cnt.hr + geom_point(aes(color = daypart)) + ggtitle("Total Rental by Hour")

It is shown two peaks during the morning and evening peaking hours from 7am to 9am and from 5pm to 8pm. Let’s break it down into registered and casual users.

# Registered by hour
g.registered.hr <- ggplot(hour.data, aes(x = hr, y = registered))
g.registered.hr + geom_point(aes(color = daypart)) + ggtitle("Registered Rental by Hour")

The peaking hours are even more obvious for registered users. Apprently, many registered users commute to work by rental bikes.

# Casual by hour 
g.casual.hr <- ggplot(hour.data, aes(x = hr, y = casual))
g.casual.hr + geom_point(aes(color = daypart)) + ggtitle("Casual Rental by Hour")

There was little impact of the peaking hour on casual users. It implies that people who commute by rental bikes mostly are the registered users. Also, lots of casual users tend to use the service in the afternoon, which may correlate with temperature or other weather factors (because starting from 11am, it gets hotter). We would examine our hypothesis later, the relationship between temperature and the causal users.

(2)Year & Month

# Monthly total rental fluctuation in two years 
year <- function(x) {
  y = 
    if (x == 0) 2011
    else 2012
  return (y)
} 
hour.data$year <- factor(sapply(hour.data$yr, year))
g.cnt.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(cnt), colour = as.factor(year)))

g.cnt.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Total Rental Over Two Years")

The ridership increased significantly in 2012. Furthermore, since 81% of the users are registered, we assumed that the ridership of registered users went up drastically. Let’s evaluate our assumption as followed.

# Monthly registered rental fluctuation in two years 
g.registered.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(registered), colour = as.factor(year)))

g.registered.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Registered Rental Over Two Years")

Not only did the ridership of registered users increase significantly, there is also an interesting pattern. While the ridership of registered users of the first 7 months increased steadily, it appeared to be a jump from August in 2011. This might result from new policies or other environmental factors.

Notice, usage is generally lower in Winter, which may be related to lower temperature.

# Monthly casual rental fluctuation in two years
g.casual.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(casual), colour = as.factor(year)))

g.casual.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Casual Rental Over Two Years")

There are a lot more casual users in Summer and Fall than Spring and Winter. This fact also explains why casual users are more affected by the environmental settings than registered users.

Recall the previous graph, as compared to casual users, registered users’ usage curve are flatter than causal users’, because registered users who use bikes to commute are using them regularly relatively insensitive to the month.

(3) Working Day, Weekday, Holiday

# Count by hour on workingday 
g.cnt.workday <- ggplot(hour.data, aes(x = hr, y = cnt, fill = as.factor(workingday)))
g.cnt.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Total Rental by Workingday")

1 denotes working day, while 0 denotes non-working day. On working days, the peaking hours are very obvious, while on non-working days, many people use the rental bikes in the afternoon. One way to look at this is that maybe on a non-working day, people like to use the service for fun.

# Registered on workingday
g.registered.workday <- ggplot(hour.data, aes(x = hr, y = registered, fill = as.factor(workingday)))
g.registered.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Registered Rental by Workingday")

On working days, most registered users use rental bikes during peak periods, meaning that, again, most registered users are bike commuters. On non-working days, their usage is less fluctuated. Let’s evaluate:

library(data.table)
## 
## Attaching package: 'data.table'
## The following object is masked _by_ '.GlobalEnv':
## 
##     year
# Subsetting 
sub.hour.data <- hour.data[,c("daypart", "cnt", "registered", "casual")]

# Create data table
dt <-as.data.table(sub.hour.data)

# Extract data where daypart is "Peak Morning" and "Peak Evening " 
dt.peak <- dt[daypart %in% c("Peak Morning", "Peak Evening")]

# 42.3% of registered users use bike in peak hours
dt.peak[, sum(registered)] / dt[, sum(registered)]
## [1] 0.4230142

Notice that 81% of the users are registered and 42.3% of whom used bikes in peak hours, indicating the management needs to pay special attention to peak hours bike arrangement.

# Casual on workingday
g.casual.workday <- ggplot(hour.data, aes(x = hr, y = casual, fill = as.factor(workingday)))
g.casual.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Casual Rental by Workingday")

Given that most of the users are registered users (81%), of all 19% users are casual users. Unlike registered users, casual users’ pattern on working days is similar to non-working days. Notably, casual users tend to use the services on non-working days, especially in the day time, when human activities are vivid. Or maybe they couldn’t get enough bikes on working days due to their lower priority.

# Count on weekday
g.cnt.hr.byweekday <- ggplot(hour.data, aes(as.numeric(hr), as.numeric(cnt), colour = as.factor(weekday)))
g.cnt.hr.byweekday + geom_smooth(se = FALSE, method = "auto") + ggtitle("Total Rentals vs Hour (from Sunday to Monday)")

Rentals from Monday to Friday falls into one pattern, while rentals from Saturday and Sunday falls into the other. This pattern matches the result of the previous graph, reflecting that the management has to treat the arrangement of bikes on weekdays and weekends very differently.

Let’s evaluate our assumption:

# Check the propotion of registered & casual users on a working or non-working day
sub.hour.data <- hour.data[,c("workingday", "cnt", "registered", "casual")]
dt <-as.data.table(sub.hour.data)
dt.wd1 <- dt[workingday == 1]
dt.wd0 <- dt[workingday == 0]

# On a working day, 87% of users are registered 
dt.wd1[, sum(registered)] /dt.wd1[, sum(cnt)]
## [1] 0.8677004
# While on a non-working day, casual users accounts 32% of total ridership
dt.wd0[, sum(casual)] /dt.wd0[, sum(cnt)]
## [1] 0.3166468

The business insight here is that, when it is a working day, the management should stress on providing the best services for registered users.

Conversely, when it is not, although registered users are still more than casual users, the demand for casual users becomes important, especially from 12- 17pm. On a non-working day, the proportion of causal users increases from 13% on a working day to 32%.

(4) Temp, ATemp, Humidity

Just a picture of how temperature changes in a one-year time frame.

# mnth vs temp
g.temp.mnth <- ggplot(hour.data, aes(as.numeric(mnth), temp))
g.temp.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Temperature fluctuation in an Year")

As disscused above, we assumed that registered users less affected by environmental settings. Let’s see:

# Registered on temp 
g.registered.temp <- ggplot(hour.data, aes(x = temp, y = registered))
g.registered.temp + geom_point()

Without surprise, the number of rental bikes for registered users does not change much according to temperature, which proves our assumption.

# Casual on temp
g.casual.temp <- ggplot(hour.data, aes(x = temp, y = casual))
g.casual.temp + geom_point()

Casual users are more sensitive to temperature than registered users. The usage is much higher between 20 to 30 Celsius degree. Interestingly, casual users find it unbearable when the temperature exceeds 30 degrees, hence, the ridership of casual users dropped greatly.

# Registered/casual on feeled temp
hour.data$raw.atemp <- hour.data$atemp*50
g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = registered))
g.temp2 + geom_point()

g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = casual))
g.temp2 + geom_point()

The atemp plot is similar to the temp plot.

# Registered/casual on humidity
g.registered.hum <- ggplot(hour.data, aes(x = hum, y = registered))
g.registered.hum + geom_point()

g.casual.hum <- ggplot(hour.data, aes(x = hum, y = casual))
g.casual.hum + geom_point()

Casual users are more sensitive to humidity than registered users, but the casual usage is also kind of smooth except extreme humidity (e.g. heavy rain). It indicates that biking activity is not relatively sensitive to humidity.


Reflection


The plotting results have confirmed our assumption that hr, mnth, workingday, temp, hum have the major correlation with cnt/registered/casual. We also have the following interesting findings: 1) Most registered users commute to work by rental bike, while casual users do not. 2) 2012 showed an increase in users from 2011, contributed majorly by registered users. 3) On working days and non-working days, the usage pattern by hour differs a lot. 4) Casual users are more sensitive to weather condition than registered users.

The biking sharing system should allocate bikes considering these facts.


License, Acknowledgement, and References


[1] Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.

@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-0040-3}, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, }

[2] https://rpubs.com/saitej09/bikesharing

[3] https://rpubs.com/yroy/bike

[4] http://brandonharris.io/kaggle-bike-sharing/